How low can you go? Introducing SeXY: sex identification from low‐quantity sequencing data despite lacking assembled sex chromosomes

Abstract Accurate sex identification is crucial for elucidating the biology of a species. In the absence of directly observable sexual characteristics, sex identification of wild fauna can be challenging, if not impossible. Molecular sexing offers a powerful alternative to morphological sexing approaches. Here, we present SeXY, a novel sex‐identification pipeline, for very low‐coverage shotgun sequencing data from a single individual. SeXY was designed to utilize low‐effort screening data for sex identification and does not require a conspecific sex‐chromosome assembly as reference. We assess the accuracy of our pipeline to data quantity by downsampling sequencing data from 100,000 to 1000 mapped reads and to reference genome selection by mapping to a variety of reference genomes of various qualities and phylogenetic distance. We show that our method is 100% accurate when mapping to a high‐quality (highly contiguous N50 > 30 Mb) conspecific genome, even down to 1000 mapped reads. For lower‐quality reference assemblies (N50 < 30 Mb), our method is 100% accurate with 50,000 mapped reads, regardless of reference assembly quality or phylogenetic distance. The SeXY pipeline provides several advantages over previously implemented methods; SeXY (i) requires sequencing data from only a single individual, (ii) does not require assembled conspecific sex chromosomes, or even a conspecific reference assembly, (iii) takes into account variation in coverage across the genome, and (iv) is accurate with only 1000 mapped reads in many cases.


| INTRODUC TI ON
Accurate sex identification is critical for elucidating the life history, behavior, social structure, and demography of a species. It is particularly important for taxa where females and males differ in prey preference (e.g., Louis et al., 2021), social interactions and mating behavior (e.g., Amos et al., 1993;Pečnerová et al., 2017), and seasonal movements and dispersal (e.g., Dobson & Stephen Dobson, 1982;Gower et al., 2019;Greenwood, 1980). Reliable sex identification may also help to elucidate the impacts of past and present anthropogenic activities on wildlife, including prehistoric hunting or domestication practices (e.g., Nistelberger et al., 2019), and the identification of the sex of and sex biases in ongoing wildlife poaching (e.g., Malisa et al., 2005).
In the absence of directly observable sexual characteristics, such as morphology or behavior (Fairbairn et al., 2008), sex identification of wild fauna remains challenging, if not impossible. An additional challenge for research based on museum or palaeontological specimens is the sex identification of skeletal remains. In most cases, such as in the (sub-)fossil record, only small skeletal fragments are available. Osteological sex determination may also be limited by the degree of preservation, the age of the individual, or access to appropriate reference material with which to compare (Buonasera et al., 2020).
Molecular sexing can be used as an alternative to morphological sexing; it only requires a small tissue sample (Hrovatin & Kunej, 2018) and may even be applied to environmental samples (e.g., Durnin et al., 2007). Many molecular sexing techniques utilize information regarding the homogametic and heterogametic sexes. In mammals, and in many fishes, females are homogametic and males are heterogametic with XX and XY chromosomes, respectively (Ellegren, 2000;Í Kongsstovu et al., 2020;Moore, 1925). In birds and certain reptiles, the pattern is reversed, with females having ZW and males having ZZ chromosomes.
For tissue samples with high-quality DNA, molecular sex identification is relatively fast, inexpensive, and straightforward.
However, these approaches require specific laboratory work targeting loci in sex chromosomes (e.g., Ahlering et al., 2011) and are not suitable for samples with highly fragmented and/or degraded DNA, such as material not specifically sampled and preserved for DNA analysis (including skeletal remains, wildlife products, and museum specimens). PCR failure in method (i) and a biased amplification of the ZFX over the ZFY region (Sinding et al., 2016) in method (ii) may cause males to be misidentified as females.
The analysis of shotgun sequencing data offers a more robust approach to identify the sex of an individual; endogenous shotgun data can be retrieved from samples with low-quality DNA, with no additional laboratory procedures required to specifically target loci on sex chromosomes. Sex-identification pipelines for DNA data with a low number of target reads were originally developed for human ancient DNA data and were based on either the ratio of number of reads aligning to the X and Y chromosomes (Skoglund et al., 2013) or on the ratio of number of reads aligning to the X chromosome versus the autosomes (Mittnik et al., 2016). This last method has recently been utilized on elephants and other mammalian taxa for which the X chromosome of either a conspecific or a related reference genome is available (Bro-Jørgensen et al., 2021;de Flamingh et al., 2020). Although this approach has been shown to be efficient down to ~10,000 mapped sequencing reads, it requires either a conspecific chromosome-level assembly with known sex chromosomes or mapping to a more distantly related chromosome-level assembly, with decreased mapping efficiency as a result.
Reference genome assemblies from nonmodel vertebrate species with assembled sex chromosomes are relatively scarce.  (Cabrera et al., 2021;de Flamingh et al., 2020). In the absence of a conspecific chromosome-level assembly, alternative approaches can be used to identify scaffolds originating from sex chromosomes. Approaches include synteny-based, whole-genome alignments (e.g., Grabherr et al., 2010), and the estimation of relative coverage of each scaffold using data from known females and males of the target species (reviewed in Palmer et al., 2019). Sex identification using synteny or coverage approaches has been applied in some studies using ancient (e.g., Kirch et al., 2021) or degraded DNA (e.g., Skovrind et al., 2019).
However, the pipelines have been developed for specific species and datasets, and an assessment of the minimum level of required sequencing data and of the impact of reference genome assembly choice is lacking.
Methods exist that circumvent the need to a priori identify sexlinked scaffolds. For example, a recent fast and automated method "Sex Assignment Through Coverage" uses principal component analysis to identify sex-related scaffolds and the sex of an individual (Nursyifa et al., 2021). This approach holds promise for studies that include a relatively large number of samples, as the method requires a set of both male and female samples. However, these sample requirements may not always be met.
Here, we present a sex-identification method (SeXY) for taxa lacking a conspecific chromosome-level assembly. The method can be applied to shotgun sequencing data from mammals and potentially to any species with a heterogametic sex (e.g, birds and some reptiles, fish, and insects) in which the target and reference species share the same sex-determination system (i.e., same sex chromosomes, same sex determining locus, same sex determining gene). We use a synteny-based approach to identify putative X-linked scaffolds in the reference assembly and determine sex using the expectation that males (in mammals) have half the amount of X-chromosome genetic material compared with females. We assessed the robustness of this method using raw shotgun sequencing data from two target marine mammal species: beluga whale (Delphinapterus leucas) and polar bear (Ursus maritimus). The read data were subsampled and mapped to reference assemblies of various qualities and phylogenetic distances. We show our approach to be highly accurate F I G U R E 1 Schematic representation of the data sets and reference assemblies (RefGEN, RefX, RefY) analyzed for the two target species: beluga and polar bear. Each branch of the flowchart shows the evaluated combination of (a) reference genome assembly (RefGEN) used as mapping reference for the raw reads of each target species, (b) number of mapped reads of the target species (representing six independent data sets), and (c) reference sex-chromosome assembly (RefX and RefY) used to identify the sex-linked scaffolds (synteny). Total number of evaluated data sets per branch of the flow chart is shown at the bottom of the figure. (i) with as few as 1000 mapped reads when mapping to a highquality (chromosome level) reference genome assembly, or as few as 50,000 mapped reads when mapping to a lower-quality reference genome assembly (N50 < 30 million base pairs [Mb]); (ii) when using a phylogenetically distant reference genome assembly; and (iii) without known sex chromosomes.

| MATERIAL S AND ME THODS
The SeXY method requires (i) raw shotgun sequencing reads of a target individual; (ii) an assembled genome from either a conspecific or related species (which we term RefGEN) with the same sex determination system; and (iii) assembled X and Y chromosomes (which we term RefX and RefY, respectively), which can be either from the same or another species than the RefGEN.
We assessed the applicability of SeXY using data from two target species: beluga and polar bear. We also assessed the impact of reference assembly using four RefGEN of varying quality and phylogenetic distance to each target species and two reference sex chromosome assemblies (each comprising RefX and RefY) from species of varying phylogenetic distance. To ascertain the applicability of our method to specimens with low DNA yield, we additionally tested the impact of the number of mapped reads on the sex determination using various downsamplings ranging from 100,000 to 1000 mapped reads.

| Target species data and reference assemblies
We used publicly available Illumina shotgun sequencing reads from 10 beluga and 10 polar bear individuals (Table S1). Each species dataset comprised five females and five males. As we were interested in results produced with ≤100,000 mapped reads only, all read files were randomly downsampled to 1 million reads using the sample option in seqtk v1.3 (https://github.com/lh3/seqtk), to reduce computational time during the mapping step.
To evaluate the impact of reference genome assembly, we used four reference assemblies (RefGEN) for each target species (beluga, polar bear): two conspecific RefGEN of differing assembly quality, and two RefGEN from more divergent species ( Figure 2; Table 2). To reduce computational time and memory usage, all scaffolds <10 kilobase (kb) were removed from the RefGEN files and excluded from downstream analyses using reformat.sh from the BBmap toolsuite (Bushnell, 2014).
For beluga, we included two beluga reference assemblies: one of lower quality (non-chromosome-level) (Beluga v1, N50 161 kb F I G U R E 2 Sex determination of beluga and polar bear individuals using four reference genome assemblies (RefGEN), one combination of reference sex-chromosome assembly (RefX and RefY) for each target species, and various numbers of mapped reads. The ten beluga and ten polar bear individuals tested both comprised five females (red) and five males (blue). X axis shows number of mapped reads (square) and average number of raw reads necessary to obtain the required number of mapped reads (triangle). Y axis shows comparison of X chromosome and autosome coverage (X:A ratio) for each combination of RefGEN, RefX and RefY (CowX and HumanY, DogX and DogY), and number of mapped reads. Individuals were determined as females if their X:A ratio was ≥0.8, and as males if their X:A ratio was ≤0.7. Grey shaded horizontal bars indicate an X:A ratio of 0.7-0.8, which we interpreted as undetermined sex.
TA B L E 1 Summary table showing percentage of correct sex determination across tested combinations of reference genome assembly (RefGEN), reference sex-chromosome assembly (RefX and RefY), and number of mapped reads. Results are shown for the beluga data and the cetacean/cow RefGEN assemblies tested (left columns) and for the polar bear data and the bear/dog RefGEN assemblies tested (right columns). The value below each RefGEN indicates the assembly N50. For cells with two estimates, the left value indicates estimates including both incorrectly determined and undetermined sex, and the right value indicates estimates including incorrectly determined sex only (excluding undetermined sex). Only one value is included if both estimates were the same. Percentages in each cell are based on 10 sample individuals: five females and five males. Sex determination for each indvidual was calculated using the average value of 10 replicates. Individuals were determined as females if their X:A ratio was ≥ 0.8, and as males if their X:A ratio was ≤0.7. We interpreted an X:A ratio of 0.7-0.8 as undetermined sex. Corresponding summary table for tests using HumanX and HumanY as RefX and RefY, respectively, is provided in Table S7.  [Jones et al., 2017]) and one highly contiguous (non-chromosomelevel) (Beluga v3, N50 31 Mb [Dudchenko et al., 2017[Dudchenko et al., , 2018). We also included a relatively low-quality killer whale (Orcinus orca) assembly (Orca, N50 13 Mb [Foote et al., 2015]) and a chromosomelevel cow assembly (Cow, N50 103 Mb [Zimin et al., 2009] (Hu et al., 2017) and an annual mutation rate for polar bear of 1.6 × 10 −9 (Liu et al., 2014). The divergence between the polar bear and dog genomes is estimated at ~17%, assuming a divergence time of ~52 Ma (Hu et al., 2017) and the abovementioned polar bear mutation rate.

| Identification of putative sex-linked and autosomal scaffolds
We identified scaffolds putatively originating from sex chromosomes (both X and Y) from all RefGEN lacking assembled sex chromosomes as well as from Cow and Dog, which include assembled sex chromosomes. We did this by aligning each RefGEN with a designated pair of RefX and RefY assemblies, using satsuma synteny v2.1 (Grabherr et al., 2010) with default parameters (Figure 1). To increase efficiency and only run the synteny analysis once, we concatenated the RefX and RefY assemblies in one file.
Although our method relies on comparing X chromosome and autosomal coverage (which we term X:A ratio), we included the Y chromosome to remove possible biases due to pseudoautosomal regions (homologous regions between the X and Y chromosomes) (Helena Mangs & Morris, 2007). To reduce this bias, we removed any overlapping coordinates between the X-and Y-linked scaffold bed files using bedtools v.2.29.0 intersect (Quinlan & Hall, 2010). We identified putative autosomal scaffolds by removing the previously identified putative sex-linked scaffolds from each RefGEN.
The human sex chromosome assemblies were selected as they are the most well-assembled mammalian sex chromosomes available.
We selected the cow and dog sex-chromosome assemblies, as they each represent the highest-quality, chromosome-level assemblies with defined sex chromosomes within the same phylogenetic order as each of our target species: beluga (Artiodactyla) and polar bear (Carnivora). For the cow, we used HumanY as there was no cow Ychromosome available. We used the three RefX and RefY combinations to assess the influence of phylogenetic distance to the target species on downstream sex determination. For the cetacean/cow RefGEN dataset used for beluga, combinations (i) and (ii) were used ( Figure 1a). For the bear/dog RefGEN dataset used for polar bear, combinations (i) and (iii) were used (Figure 1b)

| Mapping and downsampling of mapped reads
Processing and mapping of raw beluga and polar bear sequencing reads to each designated RefGEN (Figure 1a)  should not include the mitochondrial genome. In our case, only in the low-quality assembly Beluga v1 was the information regarding the mitochondrial genome not specified. In case the information is not specified, or the mitochondrial genome is included in the RefGEN, it is possible to first map the reads to a mitochondrial genome and exclude those mapped reads.
To evaluate the impact of number of mapped reads on genetic sex determination, we randomly downsampled the bam files to 100,000; 50,000; 10,000; 5000; 2500, and 1000 mapped reads ( Figure 2) using BBMap (Bushnell, 2014). We evaluated the differences in the mapping efficiency to each RefGEN, measured as the number of raw reads required to obtain a specific number of mapped reads ( Figure 2, Figure S1, and Table S4).

| Sex determination
The sex of each individual was estimated based on the X chromosome:autosome coverage ratio (X:A ratio). We calculated the read depth of all sites from the X-linked scaffolds and from the autosomal scaffolds using SAMtools depth v.1.9 (Li et al., 2009), specifying minimum base and mapping qualities of 25. To take into account variation across genomic regions, we randomly selected 10 million sites from both X-linked and autosomal scaffolds independently, calculated the average coverage for those sites, and calculated the X:A ratio from the average coverages. This step was repeated 10 times (Table S5). As female mammals have two copies of the X chromosome, and males carry only one copy, we expected X:A ratios of ~1 and ~0.5 for females and males, respectively. We determined a female as correctly identified if the mean X:A ratio of the 10 replicates was ≥0.8 and a male if the mean X:A ratio of the 10 replicates was ≤0.7. We considered a X:A ratio of 0.7-0.8 as "undetermined" sex as used in previous studies Mittnik et al., 2016).
When interpreting the accuracy of the method, we considered the number of (i) correctly determined sex, (ii) "undetermined" sex, and (iii) incorrectly determined sex (Table S6). We did this to indicate whether accuracy below 100% was due to individuals with undetermined sex (with a X:A ratio of 0.7-0.8) or due to individuals with incorrectly determined sex, as the latter is more detrimental to biological inference than simply the inability to determine sex.

| Sex determination
We found the sexing approach implemented in SeXY provided 100% accuracy in sex determination across all combinations of reference genome assembly (RefGEN) and reference sex-chromosome assemby (RefX, RefY), when 100,000 and 50,000 mapped reads were available ( Figure 2, Table 1, Figure S1, Tables S6 and S7). Moreover, 100% accuracy was observed for most trials involving lower numbers of mapped reads; 10,000 and 5000. Clear exceptions could be seen when using Beluga v1 (N50 161 kb) and Orca (N50 13 Mb) as RefGEN in the beluga dataset. Inaccuracies were especially prevalent when the low-quality Beluga v1 RefGEN (N50 161 kb) was used; we found a marked decline in accuracy when using ≤10,000 mapped reads, with sex determination accuracy in some cases equivalent to random chance (down to 50%) ( Table 1).
Taken together, our results showed scaffold contiguity of the RefGEN influences the accuracy of sex determination more than phylogenetic distance. Across all trials, we found the highest percentage of correctly identified sex was obtained with highly contiguous (Beluga v3) or chromosome-level (Polar bear v1 HiC, Panda, Dog) RefGEN, regardless of whether the RefGEN was from a conspecific or a more divergent species (Table 1, Figure 2). (Table 1 and Figure 1), we found 100% accuracy in sex determination down to 10,000 mapped reads when using the higher-quality Beluga v3 (N50 31 Mb) and Cow (N50 103 Mb) RefGENs (Table 1). When we decreased the number of mapped reads below 5000, we obtained a 10%-20% decrease in accuracy, which resulted in some undetermined individuals. However, for the trials where we were able to determine sex, the sex was determined with 100% accuracy down to 1000 and 2500 mapped reads with Beluga v3 and Cow as RefGEN,

respectively.
When analyzing the polar bear dataset and DogX and DogY We also tested whether the two combinations of RefX and

| DISCUSS ION
Many biological specimens for which sex cannot be identified using morphology or other traditional approaches, such as fecal, environmental, and archeological or palaeontological material, are also likely to contain highly contaminated and/or degraded DNA (Hrovatin & Kunej, 2018). SeXY was designed to utilize low-effort screening data for sex identification. Therefore, by assessing the reliability of SeXY to various levels of sequencing effort, we evaluate its applicability to such samples. Although our results differed between reference genomes, we show that less than 5000 mapped reads can be used to accurately identify the biological sex of an individual, depending on the quality of the mapping ref- erence. This finding opens a world of possibility for studies that employ low-effort shotgun sequencing approaches to identify specimens of sufficient preservation for deeper sequencing, but which discard any data/specimens not deemed of sufficient quality. By utilizing our method, sequence information that would previously have been discarded can now be used to obtain sex-related evolutionary and biological insights. Although this has been done on several taxa (e.g., Gower et al., 2019;Pečnerová et al., 2017), our method, which does not require a priori sex-chromosome information from the target species or a reference panel of known females and males, will hopefully enable such analyses from a much wider range of species. Although only tested with up to 100,000 mapped reads, the increasing accuracy as the number of mapped reads increased means this method is also suitable for well-preserved specimens with more available sequencing data. In such cases, data could even be downsampled to increase computational speed.

| Evaluation of synteny approach
SeXY identifies sex-linked scaffolds using a synteny approach (Grabherr et al., 2010), where the reference sex-chromosome assemblies (RefX and RefY) of a chromosome-level assembly from a closely related species is used to identify sequence similarities on the refer-

| Number of mapped reads
Our finding of 100% accurate sex identification when mapping polar bear reads to the dog as RefGEN, even with only 1000 mapped reads, was somewhat unexpected, as we anticipated a decline in mapping efficiency with increasing phylogenetic distance (Prasad et al., 2021). However, these results become less surprising when considering the mapping efficiency to each RefGEN. Although sex determination was 100% accurate down to 1000 mapped reads when using these two species with ~17% divergence, approximately four times as many raw reads are required to reach the target number of mapped reads, relative to when mapping to a conspecific RefGEN ( Figure 2, Table S4). Therefore, when <5000 endogenous reads are available, it is important to weigh the number of mapped reads versus the number of raw reads, to evaluate whether mapping to a conspecific reference genome or a phylogenetically distant reference genome is more beneficial. Although not tested here, alterations in mapping quality filters may facilitate the recovery of more mapped reads and thereby more accurate sex identification. However, decreased mapping quality may also result in misalignments, biasing results. Such low endogenous read counts are unlikely to arise when sequencing DNA from well-preserved samples, but it is much more common when considering highly degraded samples such as fecal, environmental, or subfossil material.

| Quality and phylogenetic distance of the reference genome assembly
When comparing results produced by mapping beluga reads to the more fragmented Beluga v1 versus the more contiguous Beluga v3, we show the quality of the reference genome assembly can significantly impact the accuracy of sex determination. The two beluga assemblies are vastly different in quality, with scaffold N50s of 161 kb and 31 Mb, respectively. When considering <50,000 mapped reads, the more fragmented Beluga v1 assembly could not be used to accurately determine sex. A fragmented reference genome assembly of lower quality, as with Beluga v1, may lead to difficulties in accurately identifying the sex-linked scaffolds, which our method is reliant on.
Therefore, although not comprehensively investigated here, it is advisable to rather use a high-quality reference genome assembly from a phylogenetically more distant species, than a low-quality conspecific assembly. However, the accuracy of the X:A ratio using Beluga v1 as mapping reference provided 100% accuracy at 50,000 and above mapped reads. Therefore, we show that SeXY can still be used to accurately identify sex even if only a highly fragmented assembly is available, if the number of mapped reads is sufficiently high. This holds promise for the applicability of our method moving forward, as there are an increasing number of high-quality reference genomes available, and initiatives such as the Vertebrate Genome project aim to generate near error-free reference genome assemblies of many vertebrate species in the near future (Rhie et al., 2021).
Phylogenetic distance of the mapping reference genome assembly also appears to play a role. In the case of beluga mapped to the Orca RefGEN, comparisons using <10,000 mapped reads were unable to accurately identify an individual's sex. However, this finding may reflect the more fragmented assembly of the Orca (N50 = 13 Mb) relative to the other mapping references, as we were able to identify sex with 80% accuracy (89% excluding undetermined sex) using Cow as RefGEN down to 1000 mapped reads.
Furthermore, while Panda as RefGEN produced less consistent results for the polar bear than the two conspecific reference genome assemblies, the Panda results were far more consistent than when Orca was used as RefGEN for beluga, perhaps owing to the higher assembly quality of the Panda (N50 = 129 Mb). Thus, our results suggest that the quality of the reference genome assembly is far more important than phylogenetic distance between the species of interest and the mapping reference.

| Recommendations and suggested guidelines
When relatively high numbers of reads are available (>50,000 mapped reads), our results show that both the fragmentation of the assembly and phylogenetic distance to the target species do not influence the accuracy of our method. Therefore, in cases such as these, the choice of RefGen is at the discretion of the user.
However, for lower-quality samples with fewer reads mapping (<50,000 mapped reads), more discretion is required. Based on our results, the level of fragmentation of the RefGen is most important here. We found that a fragmented reference genome assembly, as with Beluga v1 (N50 161 kb), may lead to difficulties in accurately identifying the sex-linked scaffolds. Based on the clear impact of genome quality and a lack of clear impact of the phylogenetic distance between the target species and the mapping reference, we recommend using a more distant reference genome, if the quality of the closest reference genome is low and only relatively few mapped reads are available.
In conclusion, we demonstrate the method implemented in SeXY can accurately determine the sex of individuals based on very low sequencing effort and when no conspecific chromosome-level assembly is available. The SeXY pipeline provides several advantages over previously implemented methods: SeXY (i) requires data from only a single individual (a mix of female and male individuals is not required), (ii) does not require assembled conspecific sex chromosomes, or even a conspecific reference assembly, (iii) takes into account variation in coverage across the genome when calculating the X:A ratio, and (iv) can work on very low-coverage shotgun data, down to 1000 mapped reads in many cases. Although we assessed the method based on XY sex chromosomes (as in mammals), the method can in theory be applied to any species with a heterogametic and a homogametic sex (e.g, birds, and some reptiles, fish, and insects) and for which the target and reference species share the same sex determination system.

CO N FLI C T O F I NTE R E S T
The authors do not have any conflict of interest to disclose.

DATA AVA I L A B I L I T Y S TAT E M E N T
All raw data used in this paper are publicly available in GenBank, NCBI, and DNA Zoo. Accession codes can be found within the Materials and Methods and Tables S1-S3. SeXY pipeline is available  (5), 831-834. Amos, B., Schlötterer, C., & Tautz, D. (1993). Social-structure of pilot whales revealed by analytical DNA profiling. Science, 260 (5108), 670-672. Bérubé, M., & Palsbøll, P. (1996). Identification of sex in cetaceans by multiplexing with three ZFX and ZFY specific primers. Molecular Ecology, 5(2), 283-287.